Introduction to R: first steps, statistics, graphics
@FedeBioinfoThis document contains a comprehensive workflow for the course Introduction to R: first steps, statistics, graphics, offered in the context of the CTH Course Series on Statistics - 2024.
The content will follow closely the slides used during the course itself, with particular highlighting regarding the code and its output. This should allow you to replicate all the steps, constituting a starting base for extending the commands e.g. to your own datasets.
Schedule (tentatively)
9:00-10:00 — Step 0 & Bonus: How to make the most of this
10:00-12:30 — Step 1, 2, 3 (with short breaks in between)
12:30-13:30 — Lunch break
13:30-15:30 — Step 3, 4, Q&As
Let’s get to know each other a bit…
Available at https://www.r-project.org/
What can you do with R?
Why should I use R?
Why should I not use R?
Get R - and RStudio
Alternatives:
Open up RStudio - You’ll have four panes
Want to customize this?
Tools \(\rightarrow\) Global Options \(\rightarrow\) Pane Layout
Advantages of an IDE
It is good practice to keep a set of related data, analyses, and text self-contained in a single folder, called the working directory.
Why?
How?
RStudio projects!
Custom settings, per project.
Let’s create one live now - and have the workspace NOT saved
What is the best structure?
One, used consistently - not gonna touch on naming things as it can get hot quickly :)
With R…
dir.create()
file.edit()
Where am I doing things?
The working directory is the place from where R will be looking for and saving the files.
getwd()/setwd(), but not in your scripts (fails on others’ computers!)
Instructions, commands.
Scripts, console - use the editor and have a complete record on what you did!
Shortcuts FTW!
Even better: Reproducible documents, with R Markdown
Nice resources on top: * RStudio cheatsheet about the RStudio IDE! * the internet/rstats community!
The main point: describe well your problem, “help others help you”
Others need to reproduce your error to help you better: saveRDS(), dput(), sessionInfo()
R packages…
Repositories:
Our contributions so far:
https://bioconductor.org/packages/flowcatchR/https://bioconductor.org/packages/pcaExplorer/https://bioconductor.org/packages/ideal/https://bioconductor.org/packages/GeneTonic)https://bioconductor.org/packages/iSEE)For Bioconductor packages…
install.packages("BiocManager")
library("BiocManager")
BiocManager::install()
Relevant commands:
install.packages("packagename") - check it online at CRAN!installed.packages().libPaths()update.packages()library("packagename")help(package="packagename"), data(), browseVignettes(), vignette(), citation("packagename")Something you might have already done, if you worked on RNA-seq data:
BiocManager::install("SummarizedExperiment")
BiocManager::install("DESeq2")
Use this very same text document and expand this!
Why? Let’s have a look first into the Bonus Content, actually: we navigate to Section 5 and continue from there.
Here we will touch on the first commands in R, so that you can
… but not just that.
Type the following
2 + 2
# [1] 4
log(2)
# [1] 0.6931472
347 * 73841
# [1] 25622827
7 / 2
# [1] 3.5
7 %/% 2
# [1] 3
7 %% 2
# [1] 1
What did these commands do?
# this calls the help for a function to plot a histogram
?hist
# this is just the same
help(hist) ## what about ??
?apropos
apropos("row")
# [1] ".row" ".rowMeans" ".rowNamesDF<-"
# [4] ".rowSums" "arrows" "browseEnv"
# [7] "browser" "browserCondition" "browserSetDebug"
# [10] "browserText" "browseURL" "browseVignettes"
# [13] "n2mfrow" "nrow" "NROW"
# [16] "PlantGrowth" "row" "row.names"
# [19] "row.names.data.frame" "row.names.default" "row.names<-"
# [22] "row.names<-.data.frame" "row.names<-.default" "rowMeans"
# [25] "rownames" "rownames<-" "rowsum"
# [28] "rowsum.data.frame" "rowsum.default" "rowSums"
# [31] "ToothGrowth" "xpdrows.data.frame"
#rstats)getwd() and setwd() - Tab is your friend<-, =: the assignment operatorls(), rm()str()example(), help()/?[function]print()q()/quit()TRUE,FALSE,!,==,!=,<,>,<=,>=,|,&,xor()c()carsFind out what these do!
?getwd
?setwd
?`<-`
help(ls)
help(rm)
?str
?example
help(help)
?print
help(quit)
?c
?cars
# This is a comment
Careful here:
First things first:
Then:
iris dataset. What is it about at all? How many variables are included? How many observations?replicate! find out a function that replicates elements of a vector to produce this1 1 1 1
BONUS: … and this
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(1, 4)
# [1] 1 1 1 1
rep(c(1, 2, 3, 4, 5), 3)
# [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
R can recognize different general types of data
numbers (numeric)
character strings (text)
logical (e.g. class(TRUE))
factors (“integers with a set of labels”) - it is categorical data!
special ones: dates, time, …
When and where to use which?
In a previous edition of a similar course, I asked the future participants:
80-20? 90-10? Import, clean, prepare, transform your data
Sources:
havenreadxl - highly recommended!… and exporting
read.table(),write.table() + read.csv|delimstringsAsFactors=FALSEload(),save()/readRDS(),saveRDS()haven : read_sas(),read_spss() /write_sas(),write_sav()readxl: read_excel()Check out their documentation pages!
Other options: rio, RStudio GUI
Go to https://github.com/federicomarini/rbioc2016
\(\rightarrow\) inst/extdata
\(\rightarrow\) survey_responses.csv, in its raw format
You can load it directly like this
surveyrbioc <- read.csv("https://raw.githubusercontent.com/federicomarini/rbioc2016/master/inst/extdata/survey_responses.csv")
Or install the package and load it from there
library("devtools")
install_github("federicomarini/rbioc2016")
library("rbioc2016")
data(surveyrbioc)
Sometimes your data is either small and/or not in an Excel-like tabular format.
What to do? You combine the elements together!
Q1 <- c(28, 27, 33, 32, 29)
# should return this
Q1
# [1] 28 27 33 32 29
Q2 <- c("PhD student", "PhD student", "Postdoc", "PhD student", "PhD student")
Q2
# [1] "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
# ... and so on
We have seen c(). We also have
cbindrbindfirstTwo <- cbind(Q1, Q2)
firstTwo
# Q1 Q2
# [1,] "28" "PhD student"
# [2,] "27" "PhD student"
# [3,] "33" "Postdoc"
# [4,] "32" "PhD student"
# [5,] "29" "PhD student"
rbind(Q1, Q2)
# [,1] [,2] [,3] [,4] [,5]
# Q1 "28" "27" "33" "32" "29"
# Q2 "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
Is this what you wanted?
But first, what can you do on these objects?
sum(Q1)
# [1] 149
sum(Q2)
# Error in sum(Q2): invalid 'type' (character) of argument
summary(Q1)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 27.0 28.0 29.0 29.8 32.0 33.0
summary(Q2)
# Length Class Mode
# 5 character character
str(Q1)
# num [1:5] 28 27 33 32 29
str(Q2)
# chr [1:5] "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
mean(Q1)
# [1] 29.8
dim(firstTwo)
# [1] 5 2
firstTwo[, 1]
# [1] "28" "27" "33" "32" "29"
mean(firstTwo[, 1]) # Why, damn, why? Meet coercion
# [1] NA
class(firstTwo)
# [1] "matrix" "array"
matrix, data.frame and listmatrix can contain one type of data - if numeric, you unleash all the matrix algebra power!data.frame can store more types of data (one per column)list is like a big box where you can put anything - but this is not always what you wantWhat is best?
Let’s try with a list
Q3 <- c("intermediate", "poor", "good", "none", "intermediate")
mylist <- list(Q1, Q2, Q3)
mylist
# [[1]]
# [1] 28 27 33 32 29
#
# [[2]]
# [1] "PhD student" "PhD student" "Postdoc" "PhD student" "PhD student"
#
# [[3]]
# [1] "intermediate" "poor" "good" "none" "intermediate"
## access your elements with
mylist[[1]]
# [1] 28 27 33 32 29
mylist[[1]][2]
# [1] 27
How do we create a data.frame?
mydf <- data.frame(
age = Q1,
level = Q2,
rexp = Q3
)
mydf
# age level rexp
# 1 28 PhD student intermediate
# 2 27 PhD student poor
# 3 33 Postdoc good
# 4 32 PhD student none
# 5 29 PhD student intermediate
class(mydf$age)
# [1] "numeric"
data.framemydf$age # it's all about the money :)
# [1] 28 27 33 32 29
mydf[, 1]
# [1] 28 27 33 32 29
names(mydf)
# [1] "age" "level" "rexp"
rownames(mydf)
# [1] "1" "2" "3" "4" "5"
dim(mydf)
# [1] 5 3
nrow(mydf)
# [1] 5
ncol(mydf)
# [1] 3
surveyrbioc <- read.csv("https://raw.githubusercontent.com/federicomarini/rbioc2016/master/inst/extdata/survey_responses.csv")
head(surveyrbioc)
# Q1 Q2 Q3 Q4 Q5 Q6
# 1 28 PhD student intermediate intermediate good I am a regular user
# 2 27 PhD student poor intermediate poor I know it exists
# 3 33 Postdoc good good good I know it exists
# 4 32 PhD student poor poor poor Is this supposed to be in the cloud?!
# 5 29 PhD student none poor poor I know it exists
# 6 33 Postdoc intermediate intermediate intermediate Once in a while i used it
# Q7
# 1 Learn Parallelization with R (and Bioconductor)
# 2 <NA>
# 3 use R scripts in parallel context (ex: alignments in RNA-seq)
# 4 Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
# 5 Possible I will use R for editing RNA-seq data
# 6 I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
tail(surveyrbioc)
# Q1 Q2 Q3 Q4 Q5
# 17 28 Master student/else intermediate intermediate intermediate
# 18 39 Postdoc good intermediate good
# 19 32 Master student/else none poor genoWhat?
# 20 29 PhD student poor none poor
# 21 31 PhD student none none good
# 22 30 PhD student none none good
# Q6 Q7
# 17 I know it exists better understanding of R and to extend my knowledge
# 18 I am a regular user Mainly interested in parallel computing options
# 19 Is this supposed to be in the cloud?! <NA>
# 20 Is this supposed to be in the cloud?! To better understand R and perform basic analysis
# 21 I heard we had some servers around working with fastq files based on R
# 22 Is this supposed to be in the cloud?! I want to be able to analyze my sequence data on my own :P
names(surveyrbioc)
# [1] "Q1" "Q2" "Q3" "Q4" "Q5" "Q6" "Q7"
str(surveyrbioc)
# 'data.frame': 22 obs. of 7 variables:
# $ Q1: int 28 27 33 32 29 33 40 23 27 23 ...
# $ Q2: chr "PhD student" "PhD student" "Postdoc" "PhD student" ...
# $ Q3: chr "intermediate" "poor" "good" "poor" ...
# $ Q4: chr "intermediate" "intermediate" "good" "poor" ...
# $ Q5: chr "good" "poor" "good" "poor" ...
# $ Q6: chr "I am a regular user" "I know it exists" "I know it exists" "Is this supposed to be in the cloud?!" ...
# $ Q7: chr "Learn Parallelization with R (and Bioconductor)" NA "use R scripts in parallel context (ex: alignments in RNA-seq)" "Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would"| __truncated__ ...
summary(surveyrbioc)
# Q1 Q2 Q3 Q4 Q5
# Min. :23.00 Length:22 Length:22 Length:22 Length:22
# 1st Qu.:27.25 Class :character Class :character Class :character Class :character
# Median :29.50 Mode :character Mode :character Mode :character Mode :character
# Mean :30.14
# 3rd Qu.:32.75
# Max. :40.00
# Q6 Q7
# Length:22 Length:22
# Class :character Class :character
# Mode :character Mode :character
#
#
#
surveyrbioc[, ]
# Q1 Q2 Q3 Q4 Q5
# 1 28 PhD student intermediate intermediate good
# 2 27 PhD student poor intermediate poor
# 3 33 Postdoc good good good
# 4 32 PhD student poor poor poor
# 5 29 PhD student none poor poor
# 6 33 Postdoc intermediate intermediate intermediate
# 7 40 Postdoc good good intermediate
# 8 23 Master student/else poor poor good
# 9 27 PhD student none poor intermediate
# 10 23 Master student/else poor poor poor
# 11 35 Postdoc poor poor intermediate
# 12 34 Master student/else poor pro poor
# 13 31 Postdoc none none intermediate
# 14 27 PhD student none poor poor
# 15 24 Master student/else poor intermediate intermediate
# 16 28 PhD student none poor poor
# 17 28 Master student/else intermediate intermediate intermediate
# 18 39 Postdoc good intermediate good
# 19 32 Master student/else none poor genoWhat?
# 20 29 PhD student poor none poor
# 21 31 PhD student none none good
# 22 30 PhD student none none good
# Q6
# 1 I am a regular user
# 2 I know it exists
# 3 I know it exists
# 4 Is this supposed to be in the cloud?!
# 5 I know it exists
# 6 Once in a while i used it
# 7 I am a regular user
# 8 I know it exists
# 9 I know it exists
# 10 Is this supposed to be in the cloud?!
# 11 I know it exists
# 12 I know it exists
# 13 Is this supposed to be in the cloud?!
# 14 I know it exists
# 15 I know it exists
# 16 I know it exists
# 17 I know it exists
# 18 I am a regular user
# 19 Is this supposed to be in the cloud?!
# 20 Is this supposed to be in the cloud?!
# 21 I heard we had some servers around
# 22 Is this supposed to be in the cloud?!
# Q7
# 1 Learn Parallelization with R (and Bioconductor)
# 2 <NA>
# 3 use R scripts in parallel context (ex: alignments in RNA-seq)
# 4 Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
# 5 Possible I will use R for editing RNA-seq data
# 6 I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
# 7 <NA>
# 8 Get to know R better, basic commands
# 9 To look if I can process my sequencing data on my own using R
# 10 I would like to learn more about R, especially how to use it in biomedial research. Im in the 2nd Semester of the Rasterprogramm biomedicine, last Semester I had two weeks bioinformatics and we used to work with R for statistics/ChIP-Seq/microarray-data analysis, but the time was too short, to go deep into it, we just scratched the surface. So now I wish to learn some more, would be great, if I could work with the programme by myself, for example for the masterthesis.
# 11 learn more about R
# 12 Brush up on some R knowledge and maybe get some different perspective on Genome processing
# 13 to get a good and understandable introduction into R programming and bioinformatics
# 14 learning how to work with R
# 15 To learn the basics of R programming
# 16 <NA>
# 17 better understanding of R and to extend my knowledge
# 18 Mainly interested in parallel computing options
# 19 <NA>
# 20 To better understand R and perform basic analysis
# 21 working with fastq files based on R
# 22 I want to be able to analyze my sequence data on my own :P
Don’t forget some nice built-in like the View() function, in RStudio - this is a nice alternative to “pure browsing in Excel”.
Using the surveyrbioc object:
max can be your help)transpose the survey data and assign it to another variablemean(surveyrbioc$Q1)
# [1] 30.13636
max(surveyrbioc$Q1)
# [1] 40
my_transposed_survey <- t(surveyrbioc)
surveyrbioc_mod <- surveyrbioc
colnames(surveyrbioc_mod) <- c("age", "level", "rlevel", "prog_level", "genomics_level", "parcomp_level", "expectation")
surveyrbioc_mod$expectation[which.min(surveyrbioc_mod$age)]
# [1] "Get to know R better, basic commands"
Describe, explore, transform, summarise data
dim(x) shows the dimensions of an objectstr(x) provides an overview of the structure of an object and the elements it containssum(x), mean(x), sd(x) computes the sum, mean, or standard deviation of all the elements in x; median(x), quantile(x)length(x) returns the number of elements in x (a vector)sqrt(x), log(x) take the square root and the natural logarithm of a numeric - element or vectorhist(x, breaks=20, col="blue") plots a histogram of variable x with 20 bins colored blueunique(x) returns the vector of unique elements in xrm(x) removes the object x from the environment (rm(list=ls()) removes all objects)sessionInfo() prints information about R session and versions of all attached packagesThis is the basic way it works
surveyrbioc[ROWS, COLUMNS]
You can subset with…
Try to make a guess, given this vector.
vec <- c(6, 1, 3, 6, 10, 5)
What happens if you do this?
vec[2]
# [1] 1
vec[c(5, 6)]
# [1] 10 5
vec[-c(5, 6)]
# [1] 6 1 3 6
vec > 5
# [1] TRUE FALSE FALSE TRUE TRUE FALSE
vec[vec > 5]
# [1] 6 6 10
What happens if you do this?
df <- data.frame(
name = c("John", "Paul", "George", "Ringo"),
birth = c(1940, 1942, 1943, 1940),
instrument = c("guitar", "bass", "guitar", "drums")
)
df
# name birth instrument
# 1 John 1940 guitar
# 2 Paul 1942 bass
# 3 George 1943 guitar
# 4 Ringo 1940 drums
df[c(2, 4), 3]
# [1] "bass" "drums"
df[, 1]
# [1] "John" "Paul" "George" "Ringo"
df[, "instrument"]
# [1] "guitar" "bass" "guitar" "drums"
df$instrument
# [1] "guitar" "bass" "guitar" "drums"
Back to the survey
# I just want the age
surveyrbioc[, 1]
# [1] 28 27 33 32 29 33 40 23 27 23 35 34 31 27 24 28 28 39 32 29 31 30
# or
surveyrbioc$Q1
# [1] 28 27 33 32 29 33 40 23 27 23 35 34 31 27 24 28 28 39 32 29 31 30
# the first 4 columns
surveyrbioc[, c(1, 2, 3, 4)]
# Q1 Q2 Q3 Q4
# 1 28 PhD student intermediate intermediate
# 2 27 PhD student poor intermediate
# 3 33 Postdoc good good
# 4 32 PhD student poor poor
# 5 29 PhD student none poor
# 6 33 Postdoc intermediate intermediate
# 7 40 Postdoc good good
# 8 23 Master student/else poor poor
# 9 27 PhD student none poor
# 10 23 Master student/else poor poor
# 11 35 Postdoc poor poor
# 12 34 Master student/else poor pro
# 13 31 Postdoc none none
# 14 27 PhD student none poor
# 15 24 Master student/else poor intermediate
# 16 28 PhD student none poor
# 17 28 Master student/else intermediate intermediate
# 18 39 Postdoc good intermediate
# 19 32 Master student/else none poor
# 20 29 PhD student poor none
# 21 31 PhD student none none
# 22 30 PhD student none none
surveyrbioc[, 1:4]
# Q1 Q2 Q3 Q4
# 1 28 PhD student intermediate intermediate
# 2 27 PhD student poor intermediate
# 3 33 Postdoc good good
# 4 32 PhD student poor poor
# 5 29 PhD student none poor
# 6 33 Postdoc intermediate intermediate
# 7 40 Postdoc good good
# 8 23 Master student/else poor poor
# 9 27 PhD student none poor
# 10 23 Master student/else poor poor
# 11 35 Postdoc poor poor
# 12 34 Master student/else poor pro
# 13 31 Postdoc none none
# 14 27 PhD student none poor
# 15 24 Master student/else poor intermediate
# 16 28 PhD student none poor
# 17 28 Master student/else intermediate intermediate
# 18 39 Postdoc good intermediate
# 19 32 Master student/else none poor
# 20 29 PhD student poor none
# 21 31 PhD student none none
# 22 30 PhD student none none
# all but the last column
surveyrbioc[, -7]
# Q1 Q2 Q3 Q4 Q5
# 1 28 PhD student intermediate intermediate good
# 2 27 PhD student poor intermediate poor
# 3 33 Postdoc good good good
# 4 32 PhD student poor poor poor
# 5 29 PhD student none poor poor
# 6 33 Postdoc intermediate intermediate intermediate
# 7 40 Postdoc good good intermediate
# 8 23 Master student/else poor poor good
# 9 27 PhD student none poor intermediate
# 10 23 Master student/else poor poor poor
# 11 35 Postdoc poor poor intermediate
# 12 34 Master student/else poor pro poor
# 13 31 Postdoc none none intermediate
# 14 27 PhD student none poor poor
# 15 24 Master student/else poor intermediate intermediate
# 16 28 PhD student none poor poor
# 17 28 Master student/else intermediate intermediate intermediate
# 18 39 Postdoc good intermediate good
# 19 32 Master student/else none poor genoWhat?
# 20 29 PhD student poor none poor
# 21 31 PhD student none none good
# 22 30 PhD student none none good
# Q6
# 1 I am a regular user
# 2 I know it exists
# 3 I know it exists
# 4 Is this supposed to be in the cloud?!
# 5 I know it exists
# 6 Once in a while i used it
# 7 I am a regular user
# 8 I know it exists
# 9 I know it exists
# 10 Is this supposed to be in the cloud?!
# 11 I know it exists
# 12 I know it exists
# 13 Is this supposed to be in the cloud?!
# 14 I know it exists
# 15 I know it exists
# 16 I know it exists
# 17 I know it exists
# 18 I am a regular user
# 19 Is this supposed to be in the cloud?!
# 20 Is this supposed to be in the cloud?!
# 21 I heard we had some servers around
# 22 Is this supposed to be in the cloud?!
# if you don't know we had 7 columns...
surveyrbioc[, -ncol(surveyrbioc)]
# Q1 Q2 Q3 Q4 Q5
# 1 28 PhD student intermediate intermediate good
# 2 27 PhD student poor intermediate poor
# 3 33 Postdoc good good good
# 4 32 PhD student poor poor poor
# 5 29 PhD student none poor poor
# 6 33 Postdoc intermediate intermediate intermediate
# 7 40 Postdoc good good intermediate
# 8 23 Master student/else poor poor good
# 9 27 PhD student none poor intermediate
# 10 23 Master student/else poor poor poor
# 11 35 Postdoc poor poor intermediate
# 12 34 Master student/else poor pro poor
# 13 31 Postdoc none none intermediate
# 14 27 PhD student none poor poor
# 15 24 Master student/else poor intermediate intermediate
# 16 28 PhD student none poor poor
# 17 28 Master student/else intermediate intermediate intermediate
# 18 39 Postdoc good intermediate good
# 19 32 Master student/else none poor genoWhat?
# 20 29 PhD student poor none poor
# 21 31 PhD student none none good
# 22 30 PhD student none none good
# Q6
# 1 I am a regular user
# 2 I know it exists
# 3 I know it exists
# 4 Is this supposed to be in the cloud?!
# 5 I know it exists
# 6 Once in a while i used it
# 7 I am a regular user
# 8 I know it exists
# 9 I know it exists
# 10 Is this supposed to be in the cloud?!
# 11 I know it exists
# 12 I know it exists
# 13 Is this supposed to be in the cloud?!
# 14 I know it exists
# 15 I know it exists
# 16 I know it exists
# 17 I know it exists
# 18 I am a regular user
# 19 Is this supposed to be in the cloud?!
# 20 Is this supposed to be in the cloud?!
# 21 I heard we had some servers around
# 22 Is this supposed to be in the cloud?!
# you can subset with logical vectors, by row and by column
surveyrbioc[c(rep(TRUE, 10), rep(FALSE, 8)), ]
# Q1 Q2 Q3 Q4 Q5
# 1 28 PhD student intermediate intermediate good
# 2 27 PhD student poor intermediate poor
# 3 33 Postdoc good good good
# 4 32 PhD student poor poor poor
# 5 29 PhD student none poor poor
# 6 33 Postdoc intermediate intermediate intermediate
# 7 40 Postdoc good good intermediate
# 8 23 Master student/else poor poor good
# 9 27 PhD student none poor intermediate
# 10 23 Master student/else poor poor poor
# 19 32 Master student/else none poor genoWhat?
# 20 29 PhD student poor none poor
# 21 31 PhD student none none good
# 22 30 PhD student none none good
# Q6
# 1 I am a regular user
# 2 I know it exists
# 3 I know it exists
# 4 Is this supposed to be in the cloud?!
# 5 I know it exists
# 6 Once in a while i used it
# 7 I am a regular user
# 8 I know it exists
# 9 I know it exists
# 10 Is this supposed to be in the cloud?!
# 19 Is this supposed to be in the cloud?!
# 20 Is this supposed to be in the cloud?!
# 21 I heard we had some servers around
# 22 Is this supposed to be in the cloud?!
# Q7
# 1 Learn Parallelization with R (and Bioconductor)
# 2 <NA>
# 3 use R scripts in parallel context (ex: alignments in RNA-seq)
# 4 Id like to gather practise to analyse count data and develope the necessary data design. Additionally, it would be interesting to get to now how to find putative novel miRNAs.
# 5 Possible I will use R for editing RNA-seq data
# 6 I would describe myself as an advanced beginner, I am actively using R now to look (mostly) at count tables from NGS data and make plots. I am especially interested in the Bioconductor and parallelization in R section, this is new for me.
# 7 <NA>
# 8 Get to know R better, basic commands
# 9 To look if I can process my sequencing data on my own using R
# 10 I would like to learn more about R, especially how to use it in biomedial research. Im in the 2nd Semester of the Rasterprogramm biomedicine, last Semester I had two weeks bioinformatics and we used to work with R for statistics/ChIP-Seq/microarray-data analysis, but the time was too short, to go deep into it, we just scratched the surface. So now I wish to learn some more, would be great, if I could work with the programme by myself, for example for the masterthesis.
# 19 <NA>
# 20 To better understand R and perform basic analysis
# 21 working with fastq files based on R
# 22 I want to be able to analyze my sequence data on my own :P
surveyrbioc[c(TRUE, FALSE), ] # keep in mind this behavior!
# Q1 Q2 Q3 Q4 Q5
# 1 28 PhD student intermediate intermediate good
# 3 33 Postdoc good good good
# 5 29 PhD student none poor poor
# 7 40 Postdoc good good intermediate
# 9 27 PhD student none poor intermediate
# 11 35 Postdoc poor poor intermediate
# 13 31 Postdoc none none intermediate
# 15 24 Master student/else poor intermediate intermediate
# 17 28 Master student/else intermediate intermediate intermediate
# 19 32 Master student/else none poor genoWhat?
# 21 31 PhD student none none good
# Q6
# 1 I am a regular user
# 3 I know it exists
# 5 I know it exists
# 7 I am a regular user
# 9 I know it exists
# 11 I know it exists
# 13 Is this supposed to be in the cloud?!
# 15 I know it exists
# 17 I know it exists
# 19 Is this supposed to be in the cloud?!
# 21 I heard we had some servers around
# Q7
# 1 Learn Parallelization with R (and Bioconductor)
# 3 use R scripts in parallel context (ex: alignments in RNA-seq)
# 5 Possible I will use R for editing RNA-seq data
# 7 <NA>
# 9 To look if I can process my sequencing data on my own using R
# 11 learn more about R
# 13 to get a good and understandable introduction into R programming and bioinformatics
# 15 To learn the basics of R programming
# 17 better understanding of R and to extend my knowledge
# 19 <NA>
# 21 working with fastq files based on R
# guess what this does?
surveyrbioc$Q2 == "PhD student"
# [1] TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
# [17] FALSE FALSE FALSE TRUE TRUE TRUE
sum(surveyrbioc$Q2 == "PhD student")
# [1] 10
mean(surveyrbioc$Q2 == "PhD student")
# [1] 0.4545455
mean(surveyrbioc$Q1[surveyrbioc$Q2 == "PhD student"])
# [1] 28.8
sum(surveyrbioc$Q1 >= 30)
# [1] 11
sum(surveyrbioc$Q1 < 35 & surveyrbioc$Q2 == "Postdoc")
# [1] 3
sum(is.na(surveyrbioc$Q7))
# [1] 4
You can
sort and order)merge them)myord <- order(surveyrbioc$Q1)
myord
# [1] 8 10 15 2 9 14 1 16 17 5 20 22 13 21 4 19 3 6 12 11 18 7
head(surveyrbioc[myord, 1:5], 4)
# Q1 Q2 Q3 Q4 Q5
# 8 23 Master student/else poor poor good
# 10 23 Master student/else poor poor poor
# 15 24 Master student/else poor intermediate intermediate
# 2 27 PhD student poor intermediate poor
sorted_surv <- surveyrbioc[myord, 1:6]
sort() returns you the sorted data, order() the indices only
# transforming a variable
newsurvey <- surveyrbioc[, 1:5]
newsurvey$ageroot <- sqrt(newsurvey$Q1)
head(newsurvey)
# Q1 Q2 Q3 Q4 Q5 ageroot
# 1 28 PhD student intermediate intermediate good 5.291503
# 2 27 PhD student poor intermediate poor 5.196152
# 3 33 Postdoc good good good 5.744563
# 4 32 PhD student poor poor poor 5.656854
# 5 29 PhD student none poor poor 5.385165
# 6 33 Postdoc intermediate intermediate intermediate 5.744563
# creating groups out of a continuous variable
newsurvey$agegroup <- cut(newsurvey$Q1, breaks = c(20, 30, 40))
head(newsurvey)
# Q1 Q2 Q3 Q4 Q5 ageroot agegroup
# 1 28 PhD student intermediate intermediate good 5.291503 (20,30]
# 2 27 PhD student poor intermediate poor 5.196152 (20,30]
# 3 33 Postdoc good good good 5.744563 (30,40]
# 4 32 PhD student poor poor poor 5.656854 (30,40]
# 5 29 PhD student none poor poor 5.385165 (20,30]
# 6 33 Postdoc intermediate intermediate intermediate 5.744563 (30,40]
Use case for merge: you have two sets you are playing with! Think in advance what you need for that purpose…
Are PhD students significantly younger than postdocs? Are there any differences in the age of the three groups?
phds <- surveyrbioc[surveyrbioc$Q2 == "PhD student", ]
postdocs <- surveyrbioc[surveyrbioc$Q2 == "Postdoc", ]
t.test(phds$Q1, postdocs$Q1)
#
# Welch Two Sample t-test
#
# data: phds$Q1 and postdocs$Q1
# t = -4.0528, df = 6.4476, p-value = 0.005767
# alternative hypothesis: true difference in means is not equal to 0
# 95 percent confidence interval:
# -10.146796 -2.586537
# sample estimates:
# mean of x mean of y
# 28.80000 35.16667
aov(data = surveyrbioc, Q1 ~ Q2) # What is missing here?
# Call:
# aov(formula = Q1 ~ Q2, data = surveyrbioc)
#
# Terms:
# Q2 Residuals
# Sum of Squares 216.8242 207.7667
# Deg. of Freedom 2 19
#
# Residual standard error: 3.306824
# Estimated effects may be unbalanced
Much more on this: in the next courses!
tapply
You want to calculate the median age of each academic group in here
md <- median(surveyrbioc$Q1)
md_master <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "Master student/else"])
md_phd <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "PhD student"])
md_postdocs <- median(surveyrbioc$Q1[surveyrbioc$Q2 == "Postdoc"])
c(md_master, md_phd, md_postdocs)
# [1] 26.0 28.5 34.0
tapply splits the data of the first variable on the levels of the second variable, and applies the function (any function)
tapply(X = surveyrbioc$Q1, INDEX = surveyrbioc$Q2, FUN = median)
# Master student/else PhD student Postdoc
# 26.0 28.5 34.0
lapply and sapply
Back to our iris dataset
names(iris)
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
We want the average sepal length and width, and the same for the petals. Uh, and we want the standard deviation too.
# the unefficient way:
seplen_m <- mean(iris$Sepal.Length)
sepwid_m <- mean(iris$Sepal.Width)
petlen_m <- mean(iris$Petal.Length)
petwid_m <- mean(iris$Petal.Width)
seplen_m <- sd(iris$Sepal.Length)
# ... and so on
\(\rightarrow\) Apply a Function over a List or Vector
# we will use just the first four columns
lapply(iris[, 1:4], mean)
# $Sepal.Length
# [1] 5.843333
#
# $Sepal.Width
# [1] 3.057333
#
# $Petal.Length
# [1] 3.758
#
# $Petal.Width
# [1] 1.199333
sapply(iris[, 1:4], mean)
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# 5.843333 3.057333 3.758000 1.199333
lapply(iris[, 1:4], sd)
# $Sepal.Length
# [1] 0.8280661
#
# $Sepal.Width
# [1] 0.4358663
#
# $Petal.Length
# [1] 1.765298
#
# $Petal.Width
# [1] 0.7622377
# ...
The major difference is in the presentation of the output
summary of your dataand some other friends…
Try out summary on a data.frame
summary(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
# 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
# Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
# Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
# 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
# Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Alternatives in other packages:
describe() in the Hmisc packageskim() from skimrcreate_report() from Data Explorertable
table(surveyrbioc$Q3)
#
# good intermediate none poor
# 3 3 8 8
table(surveyrbioc$Q4)
#
# good intermediate none poor pro
# 2 6 4 9 1
table(surveyrbioc$Q2, surveyrbioc$Q3)
#
# good intermediate none poor
# Master student/else 0 1 1 4
# PhD student 0 1 6 3
# Postdoc 3 1 1 1
addmargins()prop.table()ftable()addmargins(table(surveyrbioc$Q2, surveyrbioc$Q3))
#
# good intermediate none poor Sum
# Master student/else 0 1 1 4 6
# PhD student 0 1 6 3 10
# Postdoc 3 1 1 1 6
# Sum 3 3 8 8 22
prop.table(table(surveyrbioc$Q2, surveyrbioc$Q3))
#
# good intermediate none poor
# Master student/else 0.00000000 0.04545455 0.04545455 0.18181818
# PhD student 0.00000000 0.04545455 0.27272727 0.13636364
# Postdoc 0.13636364 0.04545455 0.04545455 0.04545455
Please always do check the docs!
The MASS package contains the dataset Cars93, which stores the data on 93 makes of car sold in US
Type specifies the type of market the car is aimed at. Find the cheapest car in each type, and the one with the greatest fuel efficiencydata.frames, one for US cars, the other one with non-US cars.RData)library(MASS)
head(Cars93)
# Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway AirBags
# 1 Acura Integra Small 12.9 15.9 18.8 25 31 None
# 2 Acura Legend Midsize 29.2 33.9 38.7 18 25 Driver & Passenger
# 3 Audi 90 Compact 25.9 29.1 32.3 20 26 Driver only
# 4 Audi 100 Midsize 30.8 37.7 44.6 19 26 Driver & Passenger
# 5 BMW 535i Midsize 23.7 30.0 36.2 22 30 Driver only
# 6 Buick Century Midsize 14.2 15.7 17.3 22 31 Driver only
# DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity
# 1 Front 4 1.8 140 6300 2890 Yes 13.2
# 2 Front 6 3.2 200 5500 2335 Yes 18.0
# 3 Front 6 2.8 172 5500 2280 Yes 16.9
# 4 Front 6 2.8 172 5500 2535 Yes 21.1
# 5 Rear 4 3.5 208 5700 2545 Yes 21.1
# 6 Front 4 2.2 110 5200 2565 No 16.4
# Passengers Length Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin
# 1 5 177 102 68 37 26.5 11 2705 non-USA
# 2 5 195 115 71 38 30.0 15 3560 non-USA
# 3 5 180 102 67 37 28.0 14 3375 non-USA
# 4 6 193 106 70 37 31.0 17 3405 non-USA
# 5 4 186 109 69 39 27.0 13 3640 non-USA
# 6 6 189 105 69 41 28.0 16 2880 USA
# Make
# 1 Acura Integra
# 2 Acura Legend
# 3 Audi 90
# 4 Audi 100
# 5 BMW 535i
# 6 Buick Century
?Cars93
tapply(X = Cars93$Min.Price, INDEX = Cars93$Type, FUN = min)
# Compact Large Midsize Small Sporty Van
# 8.5 17.5 12.4 6.7 9.1 13.6
tapply(X = Cars93$Horsepower, INDEX = Cars93$Type, FUN = mean)
# Compact Large Midsize Small Sporty Van
# 131.0000 179.4545 173.0909 91.0000 160.1429 149.4444
table(Cars93$Origin)
#
# USA non-USA
# 48 45
us_cars <- Cars93[Cars93$Origin == "USA", ]
nonus_cars <- Cars93[Cars93$Origin != "USA", ]
# write.csv(us_cars, file = "us_cars.csv")
# save(nonus_cars, file = "nonus_cars.RData")
Many ways for the same task:
plot)ggplot2latticeplotly, ggvis or other librariesWhy bother plotting at all?
plot functionFirst thing: take a look at the overview documentation of plot
?plot
# Help on topic 'plot' was found in the following packages:
#
# Package Library
# base /Library/Frameworks/R.framework/Resources/library
# graphics /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/library
#
#
# Using the first match ...
We will see
plot parametersRequired:
Other options
mainxlab and ylabxlim and ylimpch, col and cex - as atomic elements or as vectorsmpglibrary(ggplot2) # this is useful per se, and contains the dataset we will be using
?mpg
This dataset contains a subset of the fuel economy data that the EPA makes available on http://fueleconomy.gov
# works on RStudio
# View(mpg)
# otherwise stick to the classic
str(mpg)
# tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
# $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
# $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
# $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
# $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
# $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
# $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
# $ drv : chr [1:234] "f" "f" "f" "f" ...
# $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
# $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
# $ fl : chr [1:234] "p" "p" "p" "p" ...
# $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
Make a guess: what do you expect to see between fuel consumption and engine size?
plot(mpg$displ, mpg$cty)
Bonus: what is the correlation?
cor(mpg$displ, mpg$cty)
# [1] -0.798524
cor(mpg$displ, mpg$cty, method = "spearman")
# [1] -0.8809049
mpg$mygroup <- as.numeric(factor(mpg$class))
plot(mpg$displ, mpg$cty,
col = mpg$mygroup
)
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)
plot(mpg$displ, mpg$cty,
pch = as.numeric(factor((mpg$class)))
)
This shows we have quite some overlap of points. What can we do?
Adding some jitter…
plot(
x = mpg$displ + rnorm(nrow(mpg), mean = 0, sd = 0.01),
y = mpg$cty + rnorm(rnorm(nrow(mpg), mean = 0, sd = 0.01)),
col = mpg$mygroup,
main = "now with jitter!"
)
Adding a smoothing line
Trying to see a pattern? Add a smoothing curve.
This one is wrong - missing the reordering of points
plot(mpg$displ, mpg$cty, col = mpg$mygroup)
myloess <- loess(cty ~ displ, data = mpg)
myfit <- fitted(myloess)
lines(mpg$displ, myfit)
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)
This one is correct!
plot(mpg$displ, mpg$cty, col = mpg$mygroup)
myloess <- loess(cty ~ displ, data = mpg)
myfit <- fitted(myloess)
myord <- order(mpg$displ)
lines(mpg$displ[myord], myfit[myord])
legend("topright", legend = levels(factor(mpg$class)), col = levels(factor(mpg$mygroup)), pch = 1)
lines can add (almost) anything (any line).
points works in a similar way to superimpose, well, points
?barplot
academia_levels <- table(surveyrbioc$Q2)
barplot(academia_levels)
How is the age distributed across academic levels? Check the help of boxplot
y~x variables - ok, it can get more complicatedboxplot(Q1 ~ Q2,
data = surveyrbioc
)
Splitting on more factors
boxplot(Q1 ~ Q2 + Q3,
data = surveyrbioc
)
Making it more readable…
boxplot(Q1 ~ Q2 + Q3,
data = surveyrbioc,
las = 2
)
Changing the parameters allows you to control many aspects on plot appearance
par is your best friend - and enemy (see ?par)
par(mar = c(15, 3, 2, 2))
boxplot(Q1 ~ Q2 + Q3, data = surveyrbioc, las = 2)
par( ... ) has many arguments; here, the useful/most used ones
mar for handling the marginscex, col, pch and co. are all parameters of parlas to change the style of the axis labelsmfrow to draw an array of figureshist(surveyrbioc$Q1, breaks = 8)
hist(mpg$cty, breaks = 10)
hist(mpg$cty, breaks = 10, col = "steelblue")
hist(mpg$cty, breaks = 10, col = "steelblue", border = "gray")
hist(mpg$cty, breaks = 10, col = "steelblue", border = "gray", main = "Distribution of miles/gallon consumption in city traffic")
DON’T.
If you really need to do it…
?pie
example(pie) # expecially the last one
#
# pie> require(grDevices)
#
# pie> pie(rep(1, 24), col = rainbow(24), radius = 0.9)
#
# pie> pie.sales <- c(0.12, 0.3, 0.26, 0.16, 0.04, 0.12)
#
# pie> names(pie.sales) <- c("Blueberry", "Cherry",
# pie+ "Apple", "Boston Cream", "Other", "Vanilla Cream")
#
# pie> pie(pie.sales) # default colours
#
# pie> pie(pie.sales, col = c("purple", "violetred1", "green3",
# pie+ "cornsilk", "cyan", "white"))
#
# pie> pie(pie.sales, col = gray(seq(0.4, 1.0, length.out = 6)))
#
# pie> pie(pie.sales, density = 10, angle = 15 + 10 * 1:6)
#
# pie> pie(pie.sales, clockwise = TRUE, main = "pie(*, clockwise = TRUE)")
#
# pie> segments(0, 0, 0, 1, col = "red", lwd = 2)
#
# pie> text(0, 1, "init.angle = 90", col = "red")
#
# pie> n <- 200
#
# pie> pie(rep(1, n), labels = "", col = rainbow(n), border = NA,
# pie+ main = "pie(*, labels=\"\", col=rainbow(n), border=NA,..")
#
# pie> ## Another case showing pie() is rather fun than science:
# pie> ## (original by FinalBackwardsGlance on http://imgur.com/gallery/wWrpU4X)
# pie> pie(c(Sky = 78, "Sunny side of pyramid" = 17, "Shady side of pyramid" = 5),
# pie+ init.angle = 315, col = c("deepskyblue", "yellow", "yellow3"), border = FALSE)
pie(c(20, 80),
init.angle = -40,
col = c("white", "yellow"),
label = c("no pacman", "pacman"),
border = "lightgrey"
)
… or switch from pie to waffle (seriously)
DON’T. And this time I mean it
sadly enough there would be packages for this, too
a.k.a. Why is this bad?
age_by_group <- tapply(surveyrbioc$Q1, surveyrbioc$Q2, mean)
sd_by_group <- tapply(surveyrbioc$Q1, surveyrbioc$Q2, sd)
mybar <- barplot(age_by_group, col = c("khaki", "salmon", "firebrick"), ylim = c(0, max(age_by_group) + 5))
# mybar, inspect it
arrows(mybar, age_by_group, mybar, (age_by_group + sd_by_group), length = 0.15, angle = 90)
Dynamite plots VS boxplots
boxplot(Q1 ~ Q2,
data = surveyrbioc
)
Median VS distribution VS actual points… What do you really want to show?
type in ?plotlogtextexpressionGeneral code structure for this
opendevice()
...
code for the plot
...
closedevice()
pdf("myfilename.pdf")
# see also alternatives:
## png()
## jpeg()
plot(mpg$displ, mpg$cty,
col = mpg$mygroup
)
dev.off()
Back to the iris. Three species are there. Explore the dataset in the following ways:
draw a histogram of the petal length. What do you see?
plot sepal length versus petal length. Add different colors to highlight the species
do the same for sepal width and sepal length, and this time use a different symbol for the species. Add a legend and a title if you want
(harder) calculate the mean values of each feature for each species, organizing it in a matrix where the rows are the species names. Generate a stacked bar plot with it, and another one where the bars are arranged horizontally
feel free to go back to the survey data and explore it further!
hist(iris$Petal.Length)
plot(iris$Sepal.Length, iris$Petal.Length)
plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Species)
plot(iris$Sepal.Width, iris$Sepal.Length, pch = as.numeric(factor(iris$Species)))
legend("topright", legend = levels(factor(iris$Species)), pch = unique(factor(iris$Species)))
sl_means <- tapply(iris$Sepal.Length, iris$Species, mean)
pl_means <- tapply(iris$Petal.Length, iris$Species, mean)
sw_means <- tapply(iris$Sepal.Width, iris$Species, mean)
pw_means <- tapply(iris$Petal.Width, iris$Species, mean)
mymat <- cbind(sl_means, pl_means, sw_means, pw_means)
barplot(mymat, legend.text = unique(iris$Species))
barplot(mymat, beside = TRUE, legend.text = unique(iris$Species))
pairs(iris[, 1:4], col = iris$Species)
You can use the panels even more cleverly, check the help of pairs!
This is a collection on graphs in R - with the underlying code too.
ggplot2But first, meet the gapminder data
library(gapminder)
head(gapminder)
# # A tibble: 6 × 6
# country continent year lifeExp pop gdpPercap
# <fct> <fct> <int> <dbl> <int> <dbl>
# 1 Afghanistan Asia 1952 28.8 8425333 779.
# 2 Afghanistan Asia 1957 30.3 9240934 821.
# 3 Afghanistan Asia 1962 32.0 10267083 853.
# 4 Afghanistan Asia 1967 34.0 11537966 836.
# 5 Afghanistan Asia 1972 36.1 13079460 740.
# 6 Afghanistan Asia 1977 38.4 14880372 786.
head(country_colors)
# Nigeria Egypt Ethiopia Congo, Dem. Rep. South Africa
# "#7F3B08" "#833D07" "#873F07" "#8B4107" "#8F4407"
# Sudan
# "#934607"
head(continent_colors)
# Africa Americas Asia Europe Oceania
# "#7F3B08" "#A50026" "#40004B" "#276419" "#313695"
Variables:
ggplot2 philosophygg stands for the grammar of graphics
dataaesthetics (shape, size, colour)geoms to specify how you want to have the data plottedstatistical transformationsfacets allow you to do quick elegant multi plotsIt can come across somewhat harder since
yet, it makes the whole process of “thinking data” more natural.
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
ggplot(gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point()
We can store ggplot plot objects into a variable - and build upon that later
p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp))
p + geom_point()
p + geom_point() + scale_x_log10()
p <- p + scale_x_log10()
p + geom_point(color = "blue")
p + geom_point(color = "steelblue", pch = 19, size = 8, alpha = 1 / 4)
p + geom_point(aes(color = continent))
p + geom_point(aes(col = continent), size = 4)
p + geom_point(aes(col = continent, size = pop))
p + geom_point(aes(col = continent, size = pop)) + geom_smooth()
niceone <- p + geom_point(aes(col = continent, size = pop)) +
geom_smooth(aes(col = continent), se = FALSE)
niceone
p + geom_point(aes(col = continent, size = pop)) +
geom_smooth(lwd = 2, se = FALSE, method = "lm", col = "red")
p + geom_point(aes(col = continent, size = pop)) +
geom_smooth(aes(col = continent), lwd = 2, se = FALSE, method = "lm")
p + geom_point() + facet_wrap(~continent)
p + geom_point(aes(col = continent)) + facet_wrap(~continent)
p + geom_point(aes(col = continent)) + geom_smooth() + facet_wrap(~continent)
Becomes de facto easier - works on “plot objects”.
ggsave(file = "myplot.png")
ggplot(
gapminder,
aes(x = year, y = lifeExp, group = country, color = country)
) +
geom_line(lwd = 1, show.legend = FALSE) +
scale_color_manual(values = country_colors) +
theme_bw() +
theme(strip.text = element_text(size = rel(1.1)))
bp <- ggplot(
gapminder,
aes(x = year, y = lifeExp, group = country, color = country)
) +
geom_line(lwd = 1, show.legend = FALSE) +
facet_wrap(~continent) +
scale_color_manual(values = country_colors) +
theme(strip.text = element_text(size = rel(1.1)))
bp
One step away from making things interactive - more interesting? more accessible? more explorable?
plotly::ggplotly(bp)
# now it is a categorical x VS continuous y
p <- ggplot(gapminder, aes(x = continent, y = lifeExp))
p + geom_point()
p + geom_point(alpha = 1 / 4)
It is so easy to escape dynamite plots!
p + geom_jitter()
p + geom_jitter(aes(col = continent))
p + geom_boxplot()
p + geom_boxplot() + geom_jitter(alpha = 1 / 2)
p <- ggplot(gapminder, aes(lifeExp))
p + geom_histogram()
p + geom_histogram(binwidth = 1)
Stacked histogram are much easier in this framework
p + geom_histogram(aes(color = continent))
p + geom_histogram(aes(fill = continent))
p + geom_histogram(aes(fill = continent), position = "identity")
… and so is the superimposing of more than one distribution
p + geom_histogram(aes(fill = continent), position = "identity", alpha = 0.4)
Similar to histogram, you can use also density plots
p + geom_density(aes(fill = continent), alpha = 1 / 4)
niceone + theme_bw()
niceone + theme_void()
If you really really really have to…
library("ggthemes")
niceone + theme_excel() + scale_color_excel()
try to recreate the plots you did with base graphics, this time using ggplot2
pick a nice plot you would like to have in your next manuscript: can you think of what you need to do it? I am talking of
Our aims:
R Markdown allows you to create documents that serve as a neat record of your analysis.
Why?
The key point is…
R Markdown documents present your code alongside its output (graphs, tables, etc.) with conventional text to explain it, a bit like a notebook. To do this, R Markdown uses markdown syntax.
Markdown is a very simple markup language which provides methods for creating documents with headers, images, links etc. from plain text files, while keeping the original plain text file easy to read.
You can convert Markdown documents to many other file types like .html or .pdf to display the headers, images etc..
It might sound complicated. But really isn’t,
First things first: install the required software
install.packages("rmarkdown")
library(rmarkdown)
knitr comes along, pandoc too. You should quickly be all set!ABC here, let’s go through it:
http://rmarkdown.rstudio.com/authoring_basics.html
plus… a beautiful cheat sheet is there for you!
http://rmarkdown.rstudio.com/lesson-15.html
http://www.rstudio.com/wp-content/uploads/2016/03/rmarkdown-cheatsheet-2.0.pdf
Using LaTeX? No problem, you can use \(\LaTeX\) here as well!
\(\left( f(x) = \sum_{i=0}^{n} \frac{a_i}{1+x} \right)\)
What can you do with R markdown?
The price to pay to have an Rmd document is sooooo small - and for that, you get
To create a new R Markdown file (.Rmd), select File -> New File -> R Markdown... in RStudio, then choose the file type you want to create.
The newly created .Rmd file comes with basic instructions but we want to create our own R Markdown script, so let’s get to know the different parts of an Rmd file
---sUh, you can insert figures also like this

```r
n <- 10
rnorm(n)
# [1] -0.14014756 0.80533395 -0.52655265 -0.20807110 0.97189242 1.69874146 -0.99006732 0.39609724
# [9] 0.90447946 0.04883601
```
Shortcut: Ctrl + Alt + I
Input code: you can use multiple languages including R, Python, and SQL, many more (specify the language in the chunk options)
Inline code can be added with `r 1+1`
Deatiled very nicely here: https://yihui.name/knitr/options/
A simple set of options which you can use for many documents:
set.seed(42)
knitr::opts_chunk$set(
comment = NA,
fig.align = "center",
fig.width = 7,
fig.height = 7,
warning = FALSE,
eval = TRUE
)
Use the Knit button in the RStudio IDE to render the file and preview the output with a single click or keyboard shortcut (Ctrl + Shift + K).
To generate a report from the file, run the render command (works also outside of RStudio):
library("rmarkdown")
rmarkdown::render("yourfile.Rmd")
It was a deep dive, but now…
You can do much much more (presentations, websites, manuscripts,…)
File -> New File -> R Markdown... in RStudiooutput:
word_document
Made @IMBEI
Books
R in a nutshell, R cookbook, R graphics cookbook (@O’Reilly media)
A Beginner’s Guide to R (Zuur, Ieno, Meesters, @Springer)
R Programming for Data Science (Peng, @Leanpub)
R Programming for Bioinformatics (Gentleman, @CRC)
Bioconductor Case Studies (@Springer)
Data Analysis for the Life Sciences (Irizarry, Love, @Leanpub)
Bioconductor - An Introduction to Core Technologies (Hansen, @Leanpub)
R for Data Science (Wickham, Grolemund, @O’Reilly)
the whole Use R! book series: https://link.springer.com/bookseries/6991
this one from CRC: https://www.crcpress.com/Chapman--HallCRC-The-R-Series/book-series/CRCTHERSER
Courses
Misc
As a lesson, do take note/report the versions of R/the different packages you used for your analysis.
Why?
sessionInfo()
# R version 4.3.0 (2023-04-21)
# Platform: x86_64-apple-darwin20 (64-bit)
# Running under: macOS Monterey 12.7.1
#
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# time zone: Europe/Berlin
# tzcode source: internal
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] rmarkdown_2.25 ggthemes_5.0.0 gapminder_1.0.0 ggplot2_3.4.4 MASS_7.3-60.0.1
#
# loaded via a namespace (and not attached):
# [1] tidyr_1.3.1 plotly_4.10.4 rappdirs_0.3.3 sass_0.4.8 utf8_1.2.4
# [6] generics_0.1.3 icons_0.2.0 xml2_1.3.6 lattice_0.22-5 stringi_1.8.3
# [11] digest_0.6.34 magrittr_2.0.3 evaluate_0.23 grid_4.3.0 timechange_0.3.0
# [16] bookdown_0.37 fastmap_1.1.1 Matrix_1.6-5 jsonlite_1.8.8 formatR_1.14
# [21] httr_1.4.7 mgcv_1.9-1 purrr_1.0.2 fansi_1.0.6 crosstalk_1.2.1
# [26] viridisLite_0.4.2 scales_1.3.0 lazyeval_0.2.2 jquerylib_0.1.4 cli_3.6.2
# [31] rlang_1.1.3 crayon_1.5.2 ellipsis_0.3.2 splines_4.3.0 munsell_0.5.0
# [36] withr_3.0.0 cachem_1.0.8 yaml_2.3.8 tools_4.3.0 dplyr_1.1.4
# [41] colorspace_2.1-0 assertthat_0.2.1 vctrs_0.6.5 R6_2.5.1 lifecycle_1.0.4
# [46] lubridate_1.9.3 stringr_1.5.1 htmlwidgets_1.6.4 pkgconfig_2.0.3 bslib_0.6.1
# [51] pillar_1.9.0 gtable_0.3.4 data.table_1.15.0 glue_1.7.0 xfun_0.41
# [56] tibble_3.2.1 tidyselect_1.2.0 highr_0.10 emo_0.0.0.9000 rstudioapi_0.15.0
# [61] knitr_1.45 farver_2.1.1 nlme_3.1-164 htmltools_0.5.7 labeling_0.4.3
# [66] compiler_4.3.0